The goal of this project is to analyze and estimate the factors associated with the obesity. Obesity is one of the major problems in the current generation and has been growing steadily. There are many scientific publications over the past few years, specifically concentrating on the factors included in this project’s dataset that is used largely in software industry to build better tools.
We chose this research topic after analyzing the facts from WHO about obesity [2]:
The source our research project is the paper published by Fabio Mendoza Palcechor, this paper presents data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia [1] and we acquired the dataset from the UCI Machine Learning Repository [3]. The dataset contains numerical, ordinal and categorical attributes and has sufficient observations for regression study that fulfills this course requirement. Further, this dataset is represented as comma-separated values with 17 variables or attributes and 2,111 records. It’s made available in CSV format as well as in an Attribute-Relation File Format (ARFF) format for use with the Weka machine learning software.
The dataset can be accessed and downloaded by following this link.
Why are you creating a model for this data?
Although predicting an individual’s weight might be considered an uncommon or even an exceptional ask, estimating it might prove beneficial in certain cases: - Imputing the Weight: The model will provide a better source for imputing a person’s weight if the other dietary and social predictors are present. - Industrial and commerical industries might still be able to estimate the weights of their customers if, for some reason, HIPAA considerations only allow for the availability of the predictors selected in our model. Airline, marketing and pharmaceutical companies will not be precluded to conduct studies, tests and experiments (proof of concept) if weight as a variable is not present but otherwise vital for such endeavors. (edited)
What is the goal of this model?
The goal of the model is to estimate the average weight of an individual given information about the individual’s dietary, social and other physical characteristics.
Primarily our aim is to explore these topics but not limited to
In addition to predicting weight, our research metrics about obesity and body mass index (BMI) shall be compared with the baseline published by Mendoza and WHO [1]:
This project report encompasses topics:
Load dataset
obesity=readRDS("data/obesity.Rda")
dim(obesity)
## [1] 2111 17
Provided predictors
str(obesity)
## 'data.frame': 2111 obs. of 17 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: Factor w/ 2 levels "yes","no": 1 1 1 2 2 2 1 2 1 1 ...
## $ FAVC : Factor w/ 2 levels "yes","no": 2 2 2 2 2 1 1 2 1 1 ...
## $ FCVC : num 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : Factor w/ 4 levels "1","2","3","4": 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : Factor w/ 4 levels "no","Sometimes",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ SMOKE : Factor w/ 2 levels "yes","no": 2 1 2 2 2 2 2 2 2 2 ...
## $ CH2O : num 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : Factor w/ 2 levels "yes","no": 2 1 2 2 2 2 2 2 2 2 ...
## $ FAF : num 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : Factor w/ 4 levels "no","Sometimes",..: 1 2 3 3 2 2 2 2 3 1 ...
## $ MTRANS : Factor w/ 5 levels "Automobile","Motorbike",..: 4 4 4 5 4 1 2 4 4 4 ...
## $ NObeyesdad : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 3 4 2 2 2 2 2 ...
Summary of each variable
summary(obesity)
## Gender Age Height Weight
## Female:1043 Min. :14.0 Min. :1.45 Min. : 39.0
## Male :1068 1st Qu.:19.9 1st Qu.:1.63 1st Qu.: 65.5
## Median :22.8 Median :1.70 Median : 83.0
## Mean :24.3 Mean :1.70 Mean : 86.6
## 3rd Qu.:26.0 3rd Qu.:1.77 3rd Qu.:107.4
## Max. :61.0 Max. :1.98 Max. :173.0
##
## family_history_with_overweight FAVC FCVC NCP
## yes:1726 yes:1866 Min. :1.00 1: 395
## no : 385 no : 245 1st Qu.:2.00 2: 285
## Median :2.39 3:1362
## Mean :2.42 4: 69
## 3rd Qu.:3.00
## Max. :3.00
##
## CAEC SMOKE CH2O SCC FAF
## no : 51 yes: 44 Min. :1.00 yes: 96 Min. :0.000
## Sometimes :1765 no :2067 1st Qu.:1.58 no :2015 1st Qu.:0.124
## Frequently: 242 Median :2.00 Median :1.000
## Always : 53 Mean :2.01 Mean :1.010
## 3rd Qu.:2.48 3rd Qu.:1.667
## Max. :3.00 Max. :3.000
##
## TUE CALC MTRANS
## Min. :0.000 no : 639 Automobile : 457
## 1st Qu.:0.000 Sometimes :1401 Motorbike : 11
## Median :0.625 Frequently: 70 Bike : 7
## Mean :0.658 Always : 1 Public_Transportation:1580
## 3rd Qu.:1.000 Walking : 56
## Max. :2.000
##
## NObeyesdad
## Insufficient_Weight:272
## Normal_Weight :287
## Overweight_Level_I :290
## Overweight_Level_II:290
## Obesity_Type_I :351
## Obesity_Type_II :297
## Obesity_Type_III :324
Observe Collinearity Issues Among Variables
obesity.pairs = obesity %>% dplyr::select(-Weight)
pairs.panels(obesity.pairs)
# set aside the numeric predictors
obesity_numeric = subset(obesity, select = c("Age", "Height", "NCP", "CH2O", "FAF", "TUE"))
# present the correlation table
kable(cor(obesity_numeric[sapply(obesity_numeric, function(x) !is.factor(x))], method = "pearson"), digits = 2, "html",
caption = "") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = T, font_size = 12)
| Age | Height | CH2O | FAF | TUE | |
|---|---|---|---|---|---|
| Age | 1.00 | -0.03 | -0.05 | -0.14 | -0.30 |
| Height | -0.03 | 1.00 | 0.21 | 0.29 | 0.05 |
| CH2O | -0.05 | 0.21 | 1.00 | 0.17 | 0.01 |
| FAF | -0.14 | 0.29 | 0.17 | 1.00 | 0.06 |
| TUE | -0.30 | 0.05 | 0.01 | 0.06 | 1.00 |
The pair plot below broadly shows the correlation between all the predictors in the obesity dataset.
Data distribution by Age, Gender and Mode of transportation
obesity=obesity%>%mutate(ageBins = cut(Age, breaks = seq(0,100,by=10)))
age_agg=obesity %>% count(ageBins,Gender)
ggplot(age_agg, aes(ageBins, n, fill = Gender)) +
geom_col(position = 'dodge')+
xlab("Age Group")+
ylab("Frequency")+
ggtitle("Age group and Frequency comparison by Gender")
On the plot majority of the observations are between 10 to 40 age range and 20-30 group has maximum number of records records.
age_mot_agg=obesity %>% count(ageBins,MTRANS)
ggplot(age_mot_agg, aes(fill=MTRANS,y=n, x=ageBins)) +
geom_bar(position="stack", stat="identity") +
scale_fill_viridis(discrete = T) +
xlab("Age Group")+
ylab("Frequency")+
ggtitle("Age group and Frequency comparison by Mode of Transport")
Graph shows interesting factors about mode of transportation, majority of people under 30 prefer Public transport and conversely, above 30 prefer Automobile for transport.
Attributes related with eating habits analysis:
Frequent consumption of high caloric food (FAVC)
Frequency of consumption of vegetables (FCVC)
Number of main meals (NCP)
Consumption of food between meals (CAEC)
Consumption of water daily (CH20)
Consumption of alcohol (CALC)
p1 = ggboxplot(obesity, x = "family_history_with_overweight", y = "Weight",
color = "family_history_with_overweight", palette = "jco")
p2 = ggboxplot(obesity, x = "Gender", y = "Weight",
color = "Gender", palette = "jco")
p3=ggboxplot(obesity, x = "MTRANS", y = "Weight",
color = "MTRANS", palette = "jco")
p4=ggboxplot(obesity, x = "FAVC", y = "Weight",
color = "FAVC", palette = "jco")+
xlab("FAVC")
p5=ggboxplot(obesity, x = "NCP", y = "Weight",
color = "NCP", palette = "jco")+
xlab("NCP")
p6=ggboxplot(obesity, x = "CAEC", y = "Weight",
color = "CAEC", palette = "jco")+
xlab("CAEC")
p7=ggboxplot(obesity, x = "SMOKE", y = "Weight",
color = "SMOKE", palette = "jco")+
xlab("SMOKE")
ggarrange(p1,p2,p3,nrow=3,ncol=1)
ggarrange(p4,p5,p6,p7,nrow=2,ncol=2)
Plot shows categorical variables related to eating habits have relationship with Weight.
ggplot(obesity, aes(x=Age, color=Gender)) +
geom_histogram(fill="white", position="dodge")+
theme(legend.position="top")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p1=ggplot(data = obesity) +
geom_point(mapping = aes(x = Height, y = Weight,color=NCP))+
ggtitle("Height vs Weight by Number of main meals (NCP) ")
p2=ggplot(data = obesity) +
geom_point(mapping = aes(x = Height, y = Weight,color=SMOKE))+
ggtitle("Height vs Weight by Smoking habit")
ggarrange(p1,p2,nrow=2,ncol=1)
ggplot(data = obesity) +
geom_point(mapping = aes(x = Height, y = Weight,color=FAVC))+
ggtitle("Height vs Weight by Frequent consumption of high caloric food")
Attributes related with the physical condition are:
ggboxplot(obesity, x = "SCC", y = "Weight",
color = "SCC", palette = "jco")
Above depicts people monitoring calories consumption are more likely to be fit.
Coerce Categorical Variables to Factors
The columns and descriptions having been identified, we now proceed to coerce those variables that need to be defined as factor variables to aid in our modeling later.
load_cleansed=function(data = obesity){
obesity$Gender=as.factor(as.integer(obesity$Gender))
obesity$family_history_with_overweight=as.factor(as.integer(obesity$family_history_with_overweight))
obesity$FAVC=as.factor(as.integer(obesity$FAVC))
obesity$FCVC=as.factor(as.integer(obesity$FCVC))
obesity$NCP=as.factor(as.integer(obesity$NCP))
obesity$CAEC=as.factor(as.integer(obesity$CAEC))
obesity$SMOKE=as.factor(as.integer(obesity$SMOKE))
obesity$SCC=as.factor(as.integer(obesity$SCC))
obesity$FAF=as.factor(as.integer(obesity$FAF))
obesity$TUE=as.factor(as.integer(obesity$TUE))
obesity$CALC=as.factor(as.integer(obesity$CALC))
obesity$MTRANS=as.factor(as.integer(obesity$MTRANS))
obesity$CH2O=as.factor(as.integer(obesity$CH2O))
obesity$NObeyesdad=as.factor(as.integer(obesity$NObeyesdad))
obesity
}
obesity = load_cleansed(obesity)
str(obesity)
## 'data.frame': 2111 obs. of 18 variables:
## $ Gender : Factor w/ 2 levels "1","2": 1 1 2 2 2 2 1 2 2 2 ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: Factor w/ 2 levels "1","2": 1 1 1 2 2 2 1 2 1 1 ...
## $ FAVC : Factor w/ 2 levels "1","2": 2 2 2 2 2 1 1 2 1 1 ...
## $ FCVC : Factor w/ 3 levels "1","2","3": 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : Factor w/ 4 levels "1","2","3","4": 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 2 2 ...
## $ SMOKE : Factor w/ 2 levels "1","2": 2 1 2 2 2 2 2 2 2 2 ...
## $ CH2O : Factor w/ 3 levels "1","2","3": 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : Factor w/ 2 levels "1","2": 2 1 2 2 2 2 2 2 2 2 ...
## $ FAF : Factor w/ 4 levels "0","1","2","3": 1 4 3 3 1 1 2 4 2 2 ...
## $ TUE : Factor w/ 3 levels "0","1","2": 2 1 2 1 1 1 1 1 2 2 ...
## $ CALC : Factor w/ 4 levels "1","2","3","4": 1 2 3 3 2 2 2 2 3 1 ...
## $ MTRANS : Factor w/ 5 levels "1","2","3","4",..: 4 4 4 5 4 1 2 4 4 4 ...
## $ NObeyesdad : Factor w/ 7 levels "1","2","3","4",..: 2 2 2 3 4 2 2 2 2 2 ...
## $ ageBins : Factor w/ 10 levels "(0,10]","(10,20]",..: 3 3 3 3 3 3 3 3 3 3 ...
We also experimented with adjusting some of the categorical variables, particularly MTRANS and NObeyesdad to reduce the factor levels, however in testing we found that did not improve our results so we used the original values for those variables.
Split dataset test and traint
set.seed(04082021)
## 75% of the sample size
smp_size <- floor(0.75 * nrow(obesity))
## set the seed to make your partition reproducible
trn_ind <- sample(seq_len(nrow(obesity)), size = smp_size)
obs_trn <- obesity[trn_ind, ]
obs_test <- obesity[-trn_ind, ]
Useful functions
#' Generate plots for regression assumptions- Linearity, Normality and Equal variance
#' i.e. Fitted vs Residual, QQ Plot, Residuals histogram
#' @param obs_mod A model for plots
plot_diagnostics=function (model){
par(mfrow = c(2, 2))
title=paste("Data from model: \n",summary(model)$call[2], summary(model)$call[3])
plot(fitted(model), resid(model),pch = 20, cex.main=1,cex.lab=1,xlab = "Fitted", ylab = "Residuals")
abline(h = 0, lty = 2, lwd = 2, col="darkorange")
qqnorm(resid(model), main = "Normal Q-Q Plot", col = "darkgrey")
qqline(resid(model), col = "dodgerblue", lwd = 2)
hist(resid(model),
xlab = "Residuals",
main = "Histogram of Residuals",
col = "darkorange",
border = "dodgerblue",
breaks = 20)
}
#' Return model performance metrics
#' R2, Adj R2,Train_RMSE, Test_RMSE, BPTEST, SP Test
#
#' @param obs_mod A model for metrics is required
#' @param test dataset- used to calculate test RMSE. Default value is obesity test data
mod_performance=function(obs_mod,obs_tst=obs_test){
smry=summary(obs_mod)
#Train RMSE
trn_rmse=sqrt(mean(resid(obs_mod)^2))
pred=predict(obs_mod,newdata=obs_tst)
#test RMSE
tst_rmse=sqrt(mean((obs_tst$Weight-pred)^2))
scores=list(r2=smry$r.squared
,adjr2=smry$adj.r.squared,
Train_rmse=trn_rmse,Test_rmse=tst_rmse,
bptest=bptest(obs_mod)$p.value
,sptest=shapiro.test(resid(obs_mod))$p.value)
unlist(scores, recursive = TRUE, use.names = TRUE)
}
obs_mod_add=lm(Weight~.-NObeyesdad-ageBins,data=obs_trn)
summary(obs_mod_add)
Note: we have removed derived variables agebins, NObeyesdad
mod_performance(obs_mod_add)
Next, we will look at unusual observations:
Outliers analysis using the cook’s distance
sum(cooks.distance(obs_mod_add)>4/nrow(obs_mod_add))
## [1] 0
There are no outliers idenitified.
Unusual observations using hatvalues
#We can not remove the records based on leverage as it's considering critical records as unusual
htval=hatvalues(obs_mod_add)
obs_mod_add_unusual_observations=obs_trn[(htval < 2 * mean(htval)),]
Further we explore the Variable selection methods AIC, BIC, Exhastive search to find the best model
obs_mod_add_aic=step(obs_mod_add,direction="both",trace=0)
obs_mod_add_bic=step(obs_mod_add,direction="both",trace=0,k=log(nrow(obs_trn)))
p1=mod_performance(obs_mod_add_aic)
p2=mod_performance(obs_mod_add_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_mod_add_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_mod_add_bic)))))
mod1_stats=data.frame(obs_mod_add_aic=p1,obs_mod_add_bic=p2)
Exhaustive search
library(leaps)
obs_mod_ex_search = summary(regsubsets(Weight ~ .-NObeyesdad-ageBins, data = obesity))
obs_mod_ex_search$which
## (Intercept) Gender2 Age Height family_history_with_overweight2 FAVC2 FCVC2
## 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## 2 TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 3 TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 4 TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 5 TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 6 TRUE FALSE FALSE TRUE TRUE FALSE FALSE
## 7 TRUE FALSE TRUE TRUE TRUE FALSE FALSE
## 8 TRUE FALSE TRUE TRUE TRUE FALSE FALSE
## FCVC3 NCP2 NCP3 NCP4 CAEC2 CAEC3 CAEC4 SMOKE2 CH2O2 CH2O3 SCC2 FAF1
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 7 TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 8 TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## FAF2 FAF3 TUE1 TUE2 CALC2 CALC3 CALC4 MTRANS2 MTRANS3 MTRANS4 MTRANS5
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 7 TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## 8 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
Quantifies the effect of collinearity using Variance Inflation Factor-
library(faraway)
##
## Attaching package: 'faraway'
## The following object is masked from 'package:psych':
##
## logit
kable(vif(obs_mod_add))
| x | |
|---|---|
| Gender2 | 2.117 |
| Age | 1.975 |
| Height | 2.115 |
| family_history_with_overweight2 | 1.356 |
| FAVC2 | 1.186 |
| FCVC2 | 3.101 |
| FCVC3 | 3.250 |
| NCP2 | 1.649 |
| NCP3 | 1.839 |
| NCP4 | 1.330 |
| CAEC2 | 6.891 |
| CAEC3 | 5.889 |
| CAEC4 | 2.226 |
| SMOKE2 | 1.077 |
| CH2O2 | 1.274 |
| CH2O3 | 1.414 |
| SCC2 | 1.134 |
| FAF1 | 1.233 |
| FAF2 | 1.357 |
| FAF3 | 1.300 |
| TUE1 | 1.209 |
| TUE2 | 1.235 |
| CALC2 | 1.302 |
| CALC3 | 1.161 |
| CALC4 | 1.046 |
| MTRANS2 | 1.070 |
| MTRANS3 | 1.056 |
| MTRANS4 | 1.969 |
| MTRANS5 | 1.366 |
wt_all = lm(Weight ~ . -ageBins - NObeyesdad, data = obs_trn)
vif(wt_all)[vif(wt_all) > 5]
## CAEC2 CAEC3
## 6.891 5.889
The above code lists what are deemed to be the variables with a very high collinearity rate (vif > 5). It seems that caec (consumption of food between meals) has a high VIF. Including these variables might pose some issues at explaining the relationship between the response and the predictors.
It is not a surprise to see bmi as among the variables that pose a collinearity issue. This is because it is highly correlated with height and weight (it is actually calculated from them).
We will retain the caec and calc predictors as their interaction with other predictors help generate a suitable model to estimate the weight of a person given the other predictors.
Further, we check if Transformations is required on the response variable. We will use Box-Cox Transformations that considers a family of transformations on positive response variable.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(faraway)
boxcox(obs_mod_add, plotit = TRUE,lambda = seq(-1, 1, by = 0.1))
obs_mod_add_trns=lm((((Weight^0.2)-1)/0.2)~Gender+Age+Height+family_history_with_overweight+FAVC+FCVC+CAEC+SMOKE+FAF+CALC+MTRANS,data=obs_trn)
summary(obs_mod_add_trns)$r.squared
## [1] 0.6506
summary(obs_mod_add_trns)$adj.r.squared
## [1] 0.6459
We observe slight performance improvement. We will be trying this approach on further models
At this stage we perform the following tasks
obs_int_mod2=lm(formula = Weight ~ (. - NObeyesdad -ageBins)^2, data = obs_trn)
obs_int_mod2_aic=step(obs_int_mod2,direction="both",trace=0)
obs_int_mod2_bic=step(obs_int_mod2,direction="both",trace=0,k=log(nrow(obs_trn)))
p1=mod_performance(obs_int_mod2_aic)
## Warning in predict.lm(obs_mod, newdata = obs_tst): prediction from a rank-
## deficient fit may be misleading
p2=mod_performance(obs_int_mod2_bic)
## Warning in predict.lm(obs_mod, newdata = obs_tst): prediction from a rank-
## deficient fit may be misleading
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod2_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod2_bic)))))
mod2_stats=data.frame(obs_mod_add_aic=p1,obs_mod_add_bic=p2)
obs_int_mod3=lm(formula = Weight ~ (. - NObeyesdad -ageBins)^3, data = obs_trn)
p1=mod_performance(obs_int_mod3)
p2=mod_performance(obs_int_mod3_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod3_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod3_bic)))))
kable(data.frame(obs_mod_add_aic=p1,obs_mod_add_bic=p2))
obs_int_mod4=lm(Weight~Gender + Age + Height + family_history_with_overweight +
FAVC + FCVC + NCP + CAEC + SMOKE + CH2O + SCC + FAF + TUE +
CALC+ Gender + MTRANS+FCVC:MTRANS:family_history_with_overweight:Gender:CALC:FAVC:NCP:SCC:Height + Gender:Height + Gender:family_history_with_overweight +
Gender:FCVC + Gender:NCP + Age:FCVC + Age:NCP +
+ Age:CH2O + Age:FAF + Age:TUE + Age:CALC + Height:family_history_with_overweight +
family_history_with_overweight:CALC ,data=obs_trn)
mod_performance(obs_int_mod4)
p4=append(p4,unlist(list(num_predictors=length(coef(obs_int_mod4)))))
mod4_stats=data.frame(obs_int_mod4=p4)
r2, adjr2, Train_rmse, Test_rmse, bptest.BP, sptest 8.409e-01 8.153e-01 1.046e+01 7.726e+01 3.339e-07 8.806e-14
Note: we omitted model 3 and model 4 in our final decision as they took far too long to execute and did not yield worthy results. However their performance results are included here for reference.
library(caret)
k_fold_modn <- train(
Weight ~ . - NObeyesdad - ageBins, obs_trn,
method = "lm",
trControl = trainControl(
method = "cv", number = 5,
verboseIter = FALSE
)
)
mod_performance(k_fold_modn)
This model performance is not good as compared to the interaction model
So far we discovered Model2 obs_mod_add_aic has the best performance, at this stage we will try to apply transformation and remove outliers to see if there are any improvements
boxcox(obs_mod_add_aic, plotit = TRUE,lambda = seq(-1, 1, by = 0.1))
obs_mod_int_trans=lm((((Weight^0.2)-1)/0.2)~(.- NObeyesdad - ageBins)^2,data=obs_trn)
boxcox(obs_mod_int_trans, plotit = TRUE)
This shows the transformation was not helpful on response variable. Therefore we only apply polynomial transformations to the numerical variables. However our results for polynomial transformations were not better than
obs_int_mod2_log_poly_2=lm(formula = Weight ~ (. - NObeyesdad + I(Height^2) + I(Age^2))^2, data = obs_trn)
obs_int_mod2_log_poly2_aic=step(obs_int_mod2_log_poly_2,direction="both",trace=0)
obs_int_mod2_log_poly2_bic=step(obs_int_mod2_log_poly_2,direction="both",trace=0,k=log(nrow(obs_trn)))
p1=mod_performance(obs_int_mod2_log_poly2_aic)
p2=mod_performance(obs_int_mod2_log_poly2_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod2_log_poly2_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod2_log_poly2_bic)))))
mod2_stats=data.frame(obs_int_mod2_log_poly2_aic=p1,obs_int_mod2_log_poly2_bic=p2)
kable(mod2_stats)
Note: we omitted these models with transformations in our final decision as added significant time to execute and did not yield worthy results. Their results showed very minor differences in performance metrics and added complexity which was not desirable in our final model.
Leverage
#We cannot remove the records based on leverage as it's considering critical records as unusual.
htval=hatvalues(obs_mod_int_trans)
obs_leverages=obs_trn[(htval > 2 * mean(htval)),]
Here we found that the records containing high leverage have factor levels that are only in the training set. Had we removed these records, that would conflict with the test set as the test set would contain higher factor levels for some of the categorical variables. Hence we did not remove records with high leverage.
Outliers
outliers=cooks.distance(obs_int_mod2_aic)>4/length(resid(obs_int_mod2_aic))
obs_outliers=obs_trn[outliers,]
In the previous section we developed multiple models and use various strategies to improve the overall model performance. Next, we demonstrate our model diagnostics and performance analysis.
Total number of comparisons Model 1 & 2 comparison
obs_mod_addNumber of parameters: 30
obs_int_mod2_aicNumber of parameters: 298
anova(obs_mod_add,obs_int_mod2_aic)[2,"Pr(>F)"]<0.01
## [1] TRUE
We reject null hypothesis \(H_0\) at significance \(\alpha =0.01\). We prefer the interaction model obs_int_mod2_aic over the additive model.
##
## Call:
## lm(formula = Weight ~ . - NObeyesdad - ageBins, data = obs_trn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.57 -10.42 1.55 9.91 50.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -147.4342 11.0232 -13.37 < 2e-16 ***
## Gender2 -1.9025 1.1456 -1.66 0.0970 .
## Age 0.7665 0.0860 8.92 < 2e-16 ***
## Height 116.5762 6.1451 18.97 < 2e-16 ***
## family_history_with_overweight2 -17.1378 1.1821 -14.50 < 2e-16 ***
## FAVC2 -5.8039 1.3608 -4.27 2.1e-05 ***
## FCVC2 2.0970 1.4078 1.49 0.1365
## FCVC3 14.0891 1.5270 9.23 < 2e-16 ***
## NCP2 2.4972 1.4729 1.70 0.0902 .
## NCP3 2.4010 1.1191 2.15 0.0321 *
## NCP4 0.0234 2.5966 0.01 0.9928
## CAEC2 3.6112 2.8116 1.28 0.1992
## CAEC3 -15.6130 3.0394 -5.14 3.1e-07 ***
## CAEC4 -6.5168 3.7434 -1.74 0.0819 .
## SMOKE2 -1.7255 2.5729 -0.67 0.5025
## CH2O2 2.0069 0.8944 2.24 0.0250 *
## CH2O3 0.9379 1.7623 0.53 0.5946
## SCC2 5.6855 2.0973 2.71 0.0068 **
## FAF1 2.3175 0.9211 2.52 0.0120 *
## FAF2 -12.2188 1.3441 -9.09 < 2e-16 ***
## FAF3 -4.6587 2.3703 -1.97 0.0495 *
## TUE1 -6.0260 0.9731 -6.19 7.6e-10 ***
## TUE2 -6.0588 1.9519 -3.10 0.0019 **
## CALC2 4.8716 0.9567 5.09 4.0e-07 ***
## CALC3 3.5995 2.4021 1.50 0.1342
## CALC4 10.6735 16.0250 0.67 0.5055
## MTRANS2 10.9665 5.4175 2.02 0.0431 *
## MTRANS3 -1.7875 9.3033 -0.19 0.8477
## MTRANS4 10.5986 1.2712 8.34 < 2e-16 ***
## MTRANS5 0.6782 2.7988 0.24 0.8086
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.7 on 1553 degrees of freedom
## Multiple R-squared: 0.65, Adjusted R-squared: 0.643
## F-statistic: 99.3 on 29 and 1553 DF, p-value: <2e-16
Test statistics PValue shows all the variables are important for the model.
Note: we have removed derived variables agebins, NObeyesdad
## r2 adjr2 Train_rmse Test_rmse bptest.BP sptest
## 6.497e-01 6.431e-01 1.552e+01 1.669e+01 1.511e-28 2.229e-08
kable(mod1_stats)
| obs_mod_add_aic | obs_mod_add_bic | |
|---|---|---|
| r2 | 0.6483 | 0.6451 |
| adjr2 | 0.6427 | 0.6403 |
| Train_rmse | 15.5452 | 15.6172 |
| Test_rmse | 16.7503 | 16.7521 |
| bptest.BP | 0.0000 | 0.0000 |
| sptest | 0.0000 | 0.0000 |
| num_predictors | 26.0000 | 22.0000 |
| obs_mod_add | obs_int_mod2_aic | |
|---|---|---|
| r2 | 0.6497 | 0.8750 |
| adjr2 | 0.6431 | 0.8512 |
| Train_rmse | 15.5158 | 9.2679 |
| Test_rmse | 16.6875 | 16.2737 |
| bptest.BP | 0.0000 | 0.0000 |
| sptest | 0.0000 | 0.0000 |
| num_predictors | 30.0000 | 298.0000 |
We check if Transformations is required on the response variable. We use Box-Cox Transformations that considers a family of transformations on positive response variable.
| obs_mod_add_aic | obs_mod_add_bic | |
|---|---|---|
| r2 | 0.8750 | 0.8179 |
| adjr2 | 0.8512 | 0.8066 |
| Train_rmse | 9.2679 | 11.1848 |
| Test_rmse | 16.2737 | 13.6781 |
| bptest.BP | 0.0000 | 0.0000 |
| sptest | 0.0000 | 0.0000 |
| num_predictors | 298.0000 | 109.0000 |
Observations are:
On the Residuals vs Fitted plot, the residuals spread is not bounce constantly both side of the horizontal 0 line, we suspect Linearity assumption. Also, we would suspect this model violates the constant variance assumption as residuals spreads changes with increase in fitted value.
QQ plot shows points spread close the line and we don’t suspect the normality assumption is violated.
Residuals don’t follow normal distribution.
| obs_mod_add_aic | obs_mod_add_bic | obs_mod_add_aic | obs_mod_add_bic | |
|---|---|---|---|---|
| r2 | 0.8750 | 0.8179 | 0.6483 | 0.6451 |
| adjr2 | 0.8512 | 0.8066 | 0.6427 | 0.6403 |
| Train_rmse | 9.2679 | 11.1848 | 15.5452 | 15.6172 |
| Test_rmse | 16.2737 | 13.6781 | 16.7503 | 16.7521 |
| bptest.BP | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| sptest | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| num_predictors | 298.0000 | 109.0000 | 26.0000 | 22.0000 |
This is really interesting to see how complex model overfit the data. We see the Model1 has 189 variables and best \(R^2\) and RMSE scores however it perform poorly on test dataset ie. Test_rmse score and is the most complex out of all 4 models. Thus our best selection is Model2.
We used HatValues to identify leverages and Cook’s distance for outliers detection in Methods section. There are 196 outliers and 255 leverages
Below are models we considered in our process but did not deeply consider as their results were not as good, complexity was too high, and runtime was far too significant to allow further testing in a timely manner.
obs_int_mod3=lm(formula = Weight ~ (. - NObeyesdad -ageBins)^3, data = obs_trn)
p1=mod_performance(obs_int_mod3)
p2=mod_performance(obs_int_mod3_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod3_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod3_bic)))))
kable(data.frame(obs_mod_add_aic=p1,obs_mod_add_bic=p2))
obs_int_mod4=lm(Weight~Gender + Age + Height + family_history_with_overweight +
FAVC + FCVC + NCP + CAEC + SMOKE + CH2O + SCC + FAF + TUE +
CALC+ Gender + MTRANS+FCVC:MTRANS:family_history_with_overweight:Gender:CALC:FAVC:NCP:SCC:Height + Gender:Height + Gender:family_history_with_overweight +
Gender:FCVC + Gender:NCP + Age:FCVC + Age:NCP +
+ Age:CH2O + Age:FAF + Age:TUE + Age:CALC + Height:family_history_with_overweight +
family_history_with_overweight:CALC ,data=obs_trn)
mod_performance(obs_int_mod4)
p4=append(p4,unlist(list(num_predictors=length(coef(obs_int_mod4)))))
mod4_stats=data.frame(obs_int_mod4=p4)
r2, adjr2, Train_rmse, Test_rmse, bptest.BP, sptest 8.409e-01 8.153e-01 1.046e+01 7.726e+01 3.339e-07 8.806e-14
obs_int_mod2_log_poly_2=lm(formula = Weight ~ (. - NObeyesdad + I(Height^2) + I(Age^2))^2, data = obs_trn)
obs_int_mod2_log_poly2_aic=step(obs_int_mod2_log_poly_2,direction="both",trace=0)
obs_int_mod2_log_poly2_bic=step(obs_int_mod2_log_poly_2,direction="both",trace=0,k=log(nrow(obs_trn)))
p1=mod_performance(obs_int_mod2_log_poly2_aic)
p2=mod_performance(obs_int_mod2_log_poly2_bic)
p1=append(p1,unlist(list(num_predictors=length(coef(obs_int_mod2_log_poly2_aic)))))
p2=append(p2,unlist(list(num_predictors=length(coef(obs_int_mod2_log_poly2_bic)))))
mod2_stats=data.frame(obs_int_mod2_log_poly2_aic=p1,obs_int_mod2_log_poly2_bic=p2)
kable(mod2_stats)
Note: we omitted these models with transformations in our final decision as added significant time to execute and did not yield worthy results. Their results showed very minor differences in performance metrics and added complexity which was not desirable in our final model.